{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Kaggle Amazon"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "В этом туториале показана основная функциональность библиотеки CatBoost с использованием датасета Amazon из соревнования на [Kaggle](https://www.kaggle.com).\n",
    "Данные можно скачать [здесь](https://www.kaggle.com/c/amazon-employee-access-challenge/data) (для этого надо создать свой аккаунт на Kaggle)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Чтение данных"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "train_df = pd.read_csv('amazon/train.csv')\n",
    "test_df = pd.read_csv('amazon/test.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Подготовка датасета"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Выделение целевой переменной"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y = train_df.ACTION\n",
    "X = train_df.drop('ACTION', axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Объявление категориальных факторов"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cat_features = range(0, X.shape[1])\n",
    "print cat_features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Обучение модели"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Разделение данных на train и validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.8, random_state=1234)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Обучение модели"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from catboost import CatBoostClassifier\n",
    "model = CatBoostClassifier(\n",
    "    thread_count=2,\n",
    "    iterations=5,\n",
    "    random_seed=1136926949945377,\n",
    ")\n",
    "model.fit(\n",
    "    X_train, y_train,\n",
    "    cat_features=cat_features,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    logging_level='Silent'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print model.random_seed_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from catboost import CatBoostClassifier\n",
    "model = CatBoostClassifier(\n",
    "    thread_count=2,\n",
    "    iterations=300,\n",
    "    learning_rate=0.1,\n",
    "    random_seed=63,\n",
    "    custom_loss=['AUC', 'Accuracy'],\n",
    "    use_best_model=True\n",
    ")\n",
    "model.fit(\n",
    "    X_train, y_train,\n",
    "    cat_features=cat_features,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    logging_level='Silent',\n",
    "    plot=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print 'Model is fitted:', model.is_fitted()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print 'Model params:', model.get_params()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print 'Resulting tree count:', model.tree_count_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Способы задания датасета"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from catboost.utils import create_cd\n",
    "import os\n",
    "\n",
    "feature_names = dict()\n",
    "for column, name in enumerate(train_df):\n",
    "    if column == 0:\n",
    "        continue\n",
    "    feature_names[column - 1] = name\n",
    "    \n",
    "create_cd(\n",
    "    label=0, \n",
    "    cat_features=list(range(1, train_df.columns.shape[0])),\n",
    "    feature_names=feature_names,\n",
    "    output_path='amazon/test.cd'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat amazon/test.cd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from catboost import Pool\n",
    "pool1 = Pool(data=X, label=y, cat_features=cat_features)\n",
    "pool2 = Pool(data='amazon/train.csv', delimiter=',', has_header=True, column_description='amazon/test.cd', thread_count=2)\n",
    "\n",
    "print 'Dataset shape'\n",
    "\n",
    "print 'dataset 1:', pool1.shape, '\\ndataset 2:', pool2.shape\n",
    "\n",
    "print\n",
    "print 'Column names'\n",
    "print 'dataset 1:', pool1.get_feature_names(), '\\ndataset 2:',  pool1.get_feature_names()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Кросс-валидация"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from catboost import cv\n",
    "\n",
    "params = model.get_params()\n",
    "params['iterations'] = 4\n",
    "params['custom_loss'] = 'AUC'\n",
    "del params['use_best_model']\n",
    "\n",
    "cv_data = cv(\n",
    "    params = params,\n",
    "    pool = Pool(X, label=y, cat_features=cat_features),\n",
    "    fold_count=2,\n",
    "    type = 'Classical',\n",
    "    shuffle=True,\n",
    "    partition_random_seed=0\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "np.set_printoptions(precision=3)\n",
    "\n",
    "for name, values in cv_data.iteritems():\n",
    "    print name + ':'\n",
    "    print np.array(values)\n",
    "    print '\\n'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "best_value = np.max(cv_data['test-AUC-mean'])\n",
    "best_iter = np.argmax(cv_data['test-AUC-mean'])\n",
    "print 'Best validation AUC score: {:.2f}±{:.2f} on step {}'.format(\n",
    "    best_value,\n",
    "    cv_data['test-AUC-std'][best_iter],\n",
    "    best_iter\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Подбор параметров"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "best_model = CatBoostClassifier(\n",
    "    iterations=1500,\n",
    "    learning_rate=0.01,\n",
    "    l2_leaf_reg=3,\n",
    "    bagging_temperature=1,\n",
    "    random_strength=1,\n",
    "    one_hot_max_size=0,\n",
    "    random_seed=63,\n",
    "    use_best_model=True\n",
    ")\n",
    "best_model.fit(\n",
    "    X_train, y_train,\n",
    "    cat_features=cat_features,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    logging_level='Silent',\n",
    "    plot=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Детектор переобучения"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model_full = CatBoostClassifier(\n",
    "    eval_metric='AUC',\n",
    "    learning_rate=0.8,\n",
    "    iterations=500,\n",
    "    random_seed=42\n",
    ")\n",
    "model_full.fit(\n",
    "    X_train, y_train,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    cat_features=cat_features,\n",
    "    logging_level='Silent',\n",
    "    plot=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model_with_earlystop = CatBoostClassifier(\n",
    "    eval_metric='AUC',\n",
    "    learning_rate=0.8,\n",
    "    iterations=500,\n",
    "    random_seed=42,\n",
    "    od_type='Iter',\n",
    "    od_wait=20\n",
    ")\n",
    "\n",
    "model_with_earlystop.fit(\n",
    "    X_train, y_train,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    cat_features=cat_features,\n",
    "    logging_level='Silent',\n",
    "    plot=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model_with_trunk = CatBoostClassifier(\n",
    "    eval_metric='AUC',\n",
    "    learning_rate=0.8,\n",
    "    iterations=500,\n",
    "    random_seed=42,\n",
    "    od_type='Iter',\n",
    "    od_wait=20,\n",
    "    use_best_model=True\n",
    ")\n",
    "\n",
    "model_with_trunk.fit(\n",
    "    X_train, y_train,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    cat_features=cat_features,\n",
    "    logging_level='Silent',\n",
    "    plot=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print 'Full model tree count:', model_full.tree_count_\n",
    "print 'Early-stopped model tree count:', model_with_earlystop.tree_count_\n",
    "print 'Trunkated model tree count:', model_with_trunk.tree_count_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Сравнение нескольких моделей"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model1 = CatBoostClassifier(\n",
    "    learning_rate=0.9,\n",
    "    iterations=100,\n",
    "    train_dir='learing_rate_0.9',\n",
    "    name='learing_rate_0.9'\n",
    ")\n",
    "\n",
    "model2 = CatBoostClassifier(\n",
    "    learning_rate=0.1,\n",
    "    iterations=100,\n",
    "    train_dir='learing_rate_0.1',\n",
    "    name='learning_rate_0.1'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "model1.fit(\n",
    "    X_train, y_train,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    cat_features=cat_features,\n",
    "    logging_level='Verbose'\n",
    ")\n",
    "model2.fit(\n",
    "    X_train, y_train,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    cat_features=cat_features,\n",
    "    logging_level='Verbose'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from catboost import MetricVisualizer\n",
    "widget = MetricVisualizer(['learing_rate_0.9', 'learing_rate_0.1'])\n",
    "widget.start()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Снепшоты"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = CatBoostClassifier(\n",
    "    iterations=40,\n",
    "    save_snapshot=True,\n",
    "    snapshot_file='snapshot.bkp',\n",
    "    random_seed=43\n",
    ")\n",
    "model.fit(\n",
    "    X_train, y_train,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    cat_features=cat_features,\n",
    "    logging_level='Verbose'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Предсказание формулы"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print model.predict_proba(pool1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "predictions = list(model.staged_predict_proba(pool1, ntree_start=0, ntree_end=0, eval_period=1, thread_count=2))\n",
    "print predictions[-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Вычисление метрик на новом датасете"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tree_count = model.tree_count_\n",
    "metrics = model.eval_metrics(pool1, metrics=['Logloss','AUC','Accuracy'], ntree_start=0, ntree_end=0,\n",
    "                             eval_period=tree_count, thread_count=2)\n",
    "auc = metrics['AUC']\n",
    "print auc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Важность факторов"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Найдем самые важные факторы\n",
    "importances = best_model.feature_importances_\n",
    "print 'Feature importances:', np.array(importances)\n",
    "print 'Feature names:', np.array(pool1.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Сохранение модели"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "my_best_model = CatBoostClassifier(iterations=10)\n",
    "my_best_model.fit(\n",
    "    X_train, y_train,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    cat_features=cat_features,\n",
    "    logging_level='Verbose'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "my_best_model.save_model('catboost_model.bin')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "my_best_model.load_model('catboost_model.bin')\n",
    "print my_best_model.get_params()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Теперь у вас есть время, чтобы натренировать лучшую модель. Ее результаты вы будете отправлять на kaggle"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "widgets": {
   "state": {
    "1057714ebc614324aa3ba2cf69408966": {
     "views": [
      {
       "cell_index": 17
      }
     ]
    },
    "8381e9eed05f4a03905ae8a56d7ab4ea": {
     "views": [
      {
       "cell_index": 48
      }
     ]
    },
    "f49684e8c5c44241bfe2c7f577f5cb41": {
     "views": [
      {
       "cell_index": 53
      }
     ]
    }
   },
   "version": "1.2.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
